Main goals of the session

  1. Retrieve sequences from databases and build multiple sequence alignments.
  2. Estimate species divergence at different functional gene regions
  3. Calculate and compare evolutionary rates of two proteins with different functions.

1. Practicals organization

To put into practice the concepts of species divergence and evolutionary rates that you will learn in the theory lessons, you will analyse nucleotide divergence in two genes (obp83b and rpL32) with different biological functions in five closely related species of the genus Drosophila.

Throughout the document you will see different icons whose meaning is:

: Additional or useful information

: Practical exercise

: Hint to solve an exercise or to do a task

: Slot to answer a question

: Problem or task to be solved


2. Installing R packages

You can use either the R console in the terminal or RStudio for this practice. If you don’t have R installed, you can download the appropriate package for your system here. To install RStudio, go to this page and follow the instructions.

Before starting the exercises, you will need to install some necessary libraries for downloading sequences from the GenBank database, performing multiple sequence alignments, building phylogenetic trees, and calibrating these trees. Open the R console in the terminal (or in RStudio) and type:

 packages <- c("reutils", "ape", "seqinr", "phangorn", "phylotools")
 install.packages(setdiff(packages, rownames(installed.packages())))

3. Retrieving sequence data

You can use the efetch() function implemented in the reutils package to retrieve sequences from databases. These functions allow you to connect to GenBank and download the sequences directly to your computer in FASTA format. You will need the GenBank identifier of either the gene region or the complete genome sequence of the species and the coordinates of the desired gene region in that genome. Note that for some species the gene is encoded in the reverse complement sequence relative to the reference in the database. In these cases you will need to specify the strand in efetch. In the example below

Let’s see how to retrieve the coding regions (CDS) of the obp83b gene in different species. Table 1 lists the identifiers and coordinates of this gene region in the five species:


Table 1. Identifiers and genomic coordinates of the obp83b gene in different Drosophila species.

Species ID Coordinates in the genome Strand
D. melanogaster NT_033777 5976177-5976817 2
D. simulans NC_052523 2470507-2471257 2
D. erecta NW_020825194 26500029-26500718 1
D. elegans NW_024545863 12746083-12746802 2
D. pseudopobscura NC_046679 16336491-16337172 1



GenBank_ identifiers can be obtained from the literature (articles) by starting a search with R libraries such as rentrez, directly from the web page of this database, using esearch() from reutils, or using other programmes and more advanced bioinformatics tools. .

To download and write a file in FASTA format with the CDS sequence of the obp83b gene in these species, you can use the efetch() function of reutils.

 library(reutils)

 ## create a new working directory
 dir.create("divergence")

 ## example for D. subobscura:
 efetch("NC_046679", db="nucleotide", rettype="fasta_cds_na", retmode="text", strand=1, seqstart=16336491, seqstop=16337172, outfile="./divergence/Dpse_obp83b_cds.fasta")

db, specifies the database; rettype, specifies the subsequence to be extracted (=CDS) and the format (=FASTA); retmode, specifies the file format (=plain text); strand, 1: positive strand, 2: negative (reverse complement with respect to the reference in the database); seqstart and seqstop, are the coordinates of the sequence to be downloaded (required if we are extracting gene regions from the complete genome sequence set).


Exercise 1

  1. Download the rest of CDS sequences of the obp83b gene in the “divergence” folder and name the files as in the example (starting with a three-letter code identifying the species).
  2. Create a new file (e.g. “obp83b_cds.fasta”) with the CDS of the five species in fasta format (multi-fasta file).
    You can do this manually or, for example, use the read.dna() and write.dna() functions in the ape package.

4. Multiple sequence alignment

To identify the nucleotide substitutions that have occurred in the coding region of the opb83b gene during the divergence of the four species, one must first predict which positions are homologous (i.e. derived from a common ancestor). There are many algorithms and programs for constructing multiple sequence alignments of DNA sequences. For purely practical reasons, in this session you will use ape, an R package that implements some of these algorithms.

 library(ape)

 ## read the sequences
 myseqs<-read.dna("divergence/obp83b_cds.fasta", format="fasta")

 ## visualize the sequences
 myseqs

 ## multiple sequence alignment
  myalign <- clustal(myseqs)

 ## partial visualization of the generated alignment
 myalign

 ## write a file (maintaining the gaps introduced in the msa step)
 write.dna(myalign, file="divergence/obp83b_cds_aln.fasta", format="fasta")

5. Divergence at synonymous and nonsynonymous sites

Coding regions are special for molecular evolutionary analyses because they contain two different types of positions in terms of the functional consequences of mutations, namely synonymous and nonsynonymous positions. Synonymous changes are those that do not alter the amino acid encoded by the codon, whereas nonsynonymous changes cause an amino acid substitution in the protein. Consequently, synonymous sites are sites where if a change occurs, it will be a synonymous change. The same applies to nonsynonymous sites. Studying and comparing the variability of these two types of sites is very informative about selective constraints and the functional consequences of mutations.

Divergence is estimated as the number of substitutions per site (this allows the divergence of regions of different lengths to be compared). Therefore, to estimate synonymous and nonsynonymous divergence, we first need to calculate the number of sites of each type in the coding regions of interest. The most commonly used method to estimate these sites is the Nei-Gojobori (N-G) method (Figure 1). The N-G approach consists of estimating the proportion of changes that would be synonymous and nonsynonymous in a given codon after considering all nine possible nucleotide changes (based on the genetic code).


Figure 1. Nei-Gojobori method for estimating the number of synonymous and nonsynonymous sites in a codon


Here are some examples of functions for working with changes and synonymous sites:

 library(seqinr)

 translate<-seqinr::translate
 
 ## function to translate a codon:
 dna_to_aa <- function(codon) {
   dna<-s2c(codon)
   aa<-translate(dna, frame = 0, sens = "F", numcode = 1, NAstring = "X", ambiguous = FALSE)
   return (aa)
 }  

 ## function to determine if a change between two codons is synonym
 is_synonymous<-function(codon1, codon2) {
   return (dna_to_aa(codon1) == dna_to_aa(codon2))
 }

 ## function to estimate the number of synonymous sites in a codon:
 synpos<-function(codon) {
   pos<-s2c(codon)
   for (i in 1:length(pos)) {
        base = pos[i]
        BASES = c('A', 'G', 'T', 'C')
        other_bases = BASES[BASES!=base]
        syn=0
          for (new_base in other_bases) {
               new_pos=c(pos[0:(i-1)], new_base, pos[-(1:i)])
               new_codon=paste(new_pos, collapse="")
             s<-(is_synonymous(codon, new_codon))
               syn <- syn + length(s[s==TRUE])
          }
        synp <- syn/3
   }
 return (synp)
 }

Exercise 2

Calculate the number of synonymous and nonsynonymous changes and sites in this coding sequence alignment using the functions above:

Seq 1 ATGATGCAGAGTCTGTAA
Seq 2 ATGAGGCACAGTCTGTAA

The number of sites of each class in each sequence is the sum over all codons in the sequence, and the total number of sites of each class in the alignment is the average over all sequences. You can apply the functions separately for each codon, or consider a loop integrating functions such as s2c() and splitseq() from the seqinr package


Fortunately, you will not need to repeat this calculation for all the gene sequences we will be studying in this lab. There are R functions that can be used to calculate the synonymous and nonsynonymous divergence of a coding region directly from the CDS alignment. The seqinr package implements the KaKs() function, which allows the calculation of these divergences. Before proceeding with the calculations, we will change the sequence names in the fasta file to make it easier to identify the species in the results matrix (now the identifiers are very long and some of them do not contain the species name):

 library(phylotools)
 
 old_names<-get.fasta.name("divergence/obp83b_cds_aln.fasta")
 ## new names must be in the same order than old names...
 new_names<-c("Dmel_obp83b", 
              "Dsim_obp83b", 
              "Dere_obp83b", 
              "Dele_obp83b", 
              "Dpse_obp83b")
 ref2 <- data.frame(old_names, new_names)
 rename.fasta(infile = "divergence/obp83b_cds_aln.fasta", ref_table = ref2, outfile = "divergence/obp83b_cds_renamed_aln.fasta")

 cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")

 ## matrix with synonymous divergences (Ks)
 kaks(cds)$ks

 ## matrix with nonsynonymous divergences (Ka)
 kaks(cds)$ka

6. Divergence at noncoding sites

In addition to the CDS, noncoding sequences are also found in eukaryotic genes. These sequences correspond to introns, exonic untranslated regions (5’ or 3’) or 5’ proximal regions, often containing promoters and regulatory elements. Here, you will analyse the divergence in the 5’ proximal region and the introns of the obp83b gene in the same five species.

First, you will obtain noncoding sequences of this gene for all species. To do this, you need to know exon coordinates. Again, you can use reutils functions to read the GenBank entries in your R terminal:

 ## example for D. pseudoobscura:
 x<-efetch("NC_046679", db="nucleotide", rettype="gb", retmode="text", strand=1, seqstart=16336491, seqstop=16337172)
 con <- content(x, as = "textConnection")
 readLines(con)

The readLines() function prints the entry in GenBank format. Locate the feature “CDS” to find the coordinates of the coding region of the gene obp83b in this species.

Once you know the coordinates of CDS, you can extract 5’ and intron sequences:

 ## example for D.pseudobscura
 ## download the complete gene region
 efetch("NC_046679", db="nucleotide", rettype="fasta", retmode="text", strand=1, seqstart=16336491,      seqstop=16337172, outfile="divergence/Dpse_opb83.fasta")

 ## create an object with the sequence of the complete gene region
 seq<-read.dna("divergence/Dpse_obp83b.fasta", format="fasta")

 ## create a new sequence with only intron sequences
 ## CDS = join(54..128,186..261,317..594)
 ## 5' region = 1-53
 ## intron 1 = 129-185
 ## intron2 = 262-316
 seqnc<-seq
 seqnc=seqnc[,c(1:53,129:185,262:316)]
 
 ## change species label
 seqnc<-updateLabel(seqnc, labels(seqnc), "Dpse_obp83b")

 ## write a fasta file with noncoding sequences of the opb83b gene of ALL SPECIES (with append=TRUE, if your run this script for all species, all sequences will be added to the same file; remember to change the ids in each case)
 write.dna(seqnc, format="fasta", file="divergence/obp83b_nc.fasta", append=TRUE)

Now we can align the non-coding sequences and estimate the number of substitutions per site (KNC) of the obp83b gene. For nucleotide divergences (without any distinction of positions, just total divergence) you can use the function dist.dna() from the package ape.

 ## read the noncoding sequences
 myseqs<-read.dna("divergence/obp83b_nc.fasta", format="fasta")


 ## visualize the sequences
 myseqs

 ## multiple sequence alignment
 myalign <- clustal(myseqs)

 ## partial visualization of the generated alignment
 myalign

 ## write the alignment to file
 write.dna(myalign, file="divergence/obp83b_nc_aln.fasta", format="fasta")
 
 ## read the noncoding alignment
 nc <- read.dna("divergence/obp83b_nc_aln.fasta", format="fasta")

 ## noncoding divergences in the opb83b gene
 dist.dna(nc)

Exercise 3

Fill empty cells in the following tables with the results obtained in the different divergence analyses and answer the questions:

Table 1. Divergence in the coding region of the gene opb83b(Ka and Ks values are above and below the diagonal, respectively)

Dmel Dsim Dyak Dana Dpse
Dmel ———
Dsim ———
Dyak ———
Dana ———
Dpse ———

Divergence in the non-coding regions of the gene opb83b (Fill in the table with KNC values above the diagonal)

Dmel Dsim Dyak Dana Dpse
Dmel ———
Dsim ——— ———
Dyak ——— ——— ———
Dana ——— ——— ——— ——— ———
Dpse ——— ——— ——— ——— ———

Questions:

1. Is the obp83b gene evolving rapidly? And the Obp83b protein?

Answer:

2. In which of the analysed positions (synonymous, nonsynonymous or noncoding) are evolutionary distances higher? What is the reason for this, in your opinion?

Answer:

3. Could you consider some of the observed changes as selectively neutral? Which ones?

Answer:

4. Which species would share a more recent common ancestor? In which evolutionary distances (synonymous, nonsynonymous or noncoding) should we focus to know that?

Answer:


7. Phylogenetic tree reconstruction and tree calibration

Some of the most direct applications of nucleotide divergence estimates are the reconstruction of phylogenetic relationships and the inference of evolutionary rates. To reconstruct the phylogenetic relationships among the five Drosophila species using the information form the obp83b gene, you can use the functions from ape and phangorn packages.

As an example, the following code reconstructs the phylogenetic relationships between the five species based on synonymous divergences. It uses the synonymous divergence matrix estimated above and the Neighbor Joining algorithm

 library(phangorn)
 
 cds <- read.alignment("divergence/obp83b_cds_renamed_aln.fasta",format="fasta")
 
 ## NJ tree
 syntree <- NJ(kaks(cds)$ka)
 plot(syntree)
 write.tree(syntree, file="divergence/obp83b_Ks.tree")

By definition, a Neighbor Joining tree is an unrooted tree. There is no root node that is the common ancestor of all species. With an unrooted tree, we can tell the phylogenetic relationships between species (i.e. which species have a more recent common ancestor compared to the other species), but not the direction of evolution (i.e. which species arose more recently). In addition, branch lengths represent substitutions per site, in this case the number of substitutions per site. Sometimes, as in our example, we know which branch contains the root of the species under study (see Figure 2).

Figure 2. Phylogenetic relationships among the species of the genus Drosophila(modified from Mol Ecol Resour. 2022;22:1559–1581)

To calculate the evolutionary rate of the Obp83b protein, we need an ultrametic tree, i.e. a rooted tree in which the branch lengths represent time (not substitutions) and at least one calibration point (i.e. a dated node).

 ## read the tree to a phylo class object
 tree<-read.tree(file="divergence/obp83b_Ks.tree")

 ## plot the unroted tree
 plot(tree, main="Obp gene tree (number of nonsynonymous substitutions per site)")

 ## root the tree using Drosophila pseudoobscura as the outgroup
 rooted<-root(tree, "Dpse_obp83b")
 plot(rooted)
 add.scale.bar(ask=TRUE)

 ## set calibration
 ## select the node corresponding to the ancestor of D. melanogater, D. simulans and D. erecta, and     specify a range of 6.1-18.9 MYA
 cal <- makeChronosCalib(rooted, interactive = TRUE)

 ## fit the model (penalized likelihood)
 chr<-chronos(rooted, model="clock", calibration = cal)

 # plot de calibrated (ultrametric) tree
 plot(chr, main="calibrated tree (million years)")
 axisPhylo()
 tiplabels()
 nodelabels()
 chr$edge
 chr$edge.length

The tiplabels() and nodelabels() functions will tell you the id of each node in the tree (they are indicated in the tree). With chr$edge and chr$edge.length you can find the correspondence between edge numbers and edge lengths, and thus the estimated date in million years of each node.

You now have all the parameters you need to calculate the evolutionary rate of the Obp83b protein in Drosophila.

Exercise 4

Calculate the evolutionary rate (r) of the Obp83b protein in Drosophila using the divergences and times estimated in this practice. Think carefully about which divergence to use for this calculation and apply the formula.


Final assignment

Table 2. Identifiers and genomic coordinates of the rpL32 gene in different Drosophila species.

Species ID Coordinates in the genome Strand
D. melanogaster NT_033777 30045229-30046161 2
D. simulans NC_052523 26165102-26166121 2
D. erecta NW_020825194 2228933..2229932 1
D. serrata NW_018367383 6093667..6094458 2
D. pseudopobscura NC_046679 811678-812625 1


Using the data in Table 2, calculate the rate of evolution of the protein RpL32 in Drosophila and answer the following questions. Note that for this gene D. elegans is replaced by D. serrata.

5. Which is the evolutionary rate of the RpL32 protein?

Answer:

6. Are the evolutionary rates of the two proteins studied in this practice different?

Answer:

7. Can you provide a molecular evolutionary explanation for the answer to question 6?

Answer:

8. Which of the other positions examined here (i.e. nonsynonymous, noncoding, the full cds…) could have been used to reconstruct the phylogenetic tree prior to calibration?

Answer:


9. Looking at your results and the tree in Figure 2, do you think that the fact that some of the species are not the same in both analyses (D. elegans in obp83b and D. serrata in rpL32) influences the results? Discuss

Answer:


Deliver info

Submit this document with your answers (or just a document with your answers to the questions in any readable format, e.g. word, pdf, plain text…), and the R code used to complete the final assignment to AULAESCI

Deadline: June 28, 2024

Doubts?